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The likelihood ratio (LR) measures the relative weight of forensic 
data regarding two hypotheses. Several levels of uncertainty arise 
if frequentist methods are chosen for its assessment: the assumed 
population model only approximates the true one and its parame¬ 
ters are estimated through a limited database. Moreover, it may 
be wise to discard part of data, especially that only indirectly re¬ 
lated to the hypotheses. Different reductions define different LRs. 
Therefore, it is more sensible to talk about “a” LR instead of “the” 

LR, and the error involved in the estimation should be quantified. 

Two frequentist methods are proposed in the light of these points 
for the ‘rare type match problem’, that is when a match between the 
perpetrator’s and the suspect’s DNA profile, never observed before 
in the database of reference, is to be evaluated. 
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1 Introduction 

One of the main challenges of forensic science is to evaluate how much some evidence can 
be helpful to discriminate between hypotheses of interest. For instance, a typical piece 
of evidence may be a DNA trace which is found at the crime scene and whose profile 
matches a known suspect’s DNA profile. A couple of mutually exclusive hypotheses is 
typically defined, of the kind of ‘the crime stain came from the suspect’ (h p ) and ‘the 
crime stain came from an unknown donor’ (h c [). The largely accepted method to perform 
this evaluation is the calculation of the likelihood ratio , a statistic that expresses the relative 
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plausibility of the observations under the two hypotheses (Robertson and Vignaux, 1995; 
Evett and Weir, 1998; Aitken and Taroni, 2004; Balding, 2005; Steele and Balding, 2014). 

The definition of the likelihood ratio depends on whether a Bayesian or a frequentist 
approach is chosen. In the Bayesian context, after a couple of hypotheses is given, the 
likelihood ratio is defined as 


Pr(T> = d | H = h p ) 
Pr(D = d\ H = h d y 


( 1 ) 


where Pr is the Bayesian probability, reflecting the expert’s belief on the joint distribution 
of the random variables of the model, namely D (representing the data), H (representing 
the hypotheses), and © (the nuisance parameter(s)). 

On the other hand, in a frequentist context, the nuisance parameter 9 and the hy¬ 
potheses h are considered to be fixed (unknown) quantities. The frequentist probability 
(here denoted as W) can be expressed in terms of the Bayesian Pr, in the following way: 
Wg(- | h ) := Pr(- | 0 = 9, H = h), V7i. The frequentist likelihood ratio can be thus 
expressed as 


_ Wg(D = d | h p ) 

Lre = VVg(D = d\h d y 
It is important to consider that different reductions of the data D can be carried out, 
each corresponding to a different frequentist likelihood ratio. Moreover, unless we choose 
nonparametric solutions, a model choice is also performed, and there are often parameters 
to be estimated. Hence, two further levels of uncertainty have to be added to the initial 
uncertainty regarding which hypothesis is the true one. 

The main aim of this paper is to provide the message that, if a frequentist approach 
is chosen and an estimation is needed, (i) it is more sensible to talk about “a” likelihood 
ratio instead of “the” likelihood ratio, and (ii) a quantification of the error involved in the 
estimation of the likelihood ratio is to be provided along with the estimated value. 

It is believed in the forensic field that the use of frequentist methods to assess the 
likelihood ratio is not coherent, since the likelihood ratio has to be used within the Bayes’ 
theorem context, as the way to update prior odds to posterior odds. However, frequentists 
may be interested as well in the likelihood ratio, seen as a tool to measure the evidential 
value of data, independently of the Bayes’ theorem. Moreover, literature presents many 
approaches to calculate the likelihood ratio, wrongly defined as Bayesian, which in fact 
plug in Bayes estimates into a likelihood ratio defined in a frequentist way (for a discus¬ 
sion, see Cereda, 2015). We thus believed that it is important to study and discuss the 
two approaches (the Bayesian and the frequentist) separately, in order to define coherent 
methodologies and avoid unnecessary hybrid methods. This is done in Section 2. 

In forensic science, a very challenging problem is the so-called rare type match, the 
situation in which there is a match between the characteristics of some recovered material 
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and the corresponding characteristics of the control material, but these characteristics 
have not been observed yet in previously collected samples (i.e., they do not occur in 
any existing database of interest for the case). This constitutes a problem because of the 
presence of a nuisance parameter that is (related to) the proportion of individuals (or items) 
in possess of the matching characteristic in a reference population: this proportion is, in 
standard frequentist practice, estimated using the relative frequency of the characteristic 
in a previously collected database. Thus, in case of rare type match, there’s the need for 
different solutions. 

This paper discusses two frequentist methods to provide a likelihood ratio in the rare 
type match case, based respectively on the parametric discrete Laplace method (Andersen 
et ah, 2013b), and on a generalization of the nonparametric Good-Turing estimator (Good, 
1953). The latter looks similar to Brenner’s ‘k- method’ (Brenner, 2010), but is different 
inasmuch it does not need any assumption and provides two different frequencies, one for 
the prosecution’s and one for the defense’s point of view. We plan to compare the two 
methods in a future paper. 

More specifically, these two methods are here proposed as an answer to the problem 
of the rare Y-STR haplotype match: the situation in which the matching (and previously 
unseen) characteristic is a Y-STR profile. Each of the two methods is analyzed in the 
light of points (i) and (ii) discussed above, by carefully specifying the data reduction, the 
chosen probability model, and with a discussion on the different levels of error involved in 
the estimations. 

Sections 3 and 4 draw out in depth the rationale behind points (i) and (ii) above, 
Section 5 describes the paradigmatic example of the rare Y-STR haplotype match problem, 
to which we will apply the discrete Laplace method (Section 6), and the Generalized-Good 
method (Section 7) according to the guidelines exposed in the opening sections. 

2 Bayesian versus frequentist approach to likelihood ratio 
assessment 

The task of a forensic statistician is to measure the extent to which some given data favors 
one hypothesis instead of the other. For instance, the data at disposal may consist of 
a DNA trace found at the crime scene which matches a suspect’s DNA profile, and of 
a database of collected DNA profiles from a reference population or past cases. This is 
a paradigmatic example to which, from now on, we will refer generically as “the DNA 
example”. The prosecution and defense hypotheses are usually of the kind “the trace has 
been left by the suspect” (h p ) and “the trace has been left by an unknown person” ( hj ). 
Denote with h £ {h^, h p } the unknown true hypothesis, and with 8 the nuisance parameter 
involved in the assessment of the likelihood ratio. In the DNA example, the vector made 
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of all the DNA frequencies can be thought of as the nuisance parameter 9. Notice that 
there is a difference between h and 9: one ( h ) is the parameter which we ‘test’ through 
the likelihood ratio, the other ( 9 ) is a nuisance parameter involved in the likelihood ratio 
assessment. It is often possible to split the data D into E, evidence directly related to 
the crime, and B, additional information not related to the crime and only pertaining to 
the nuisance parameter 9. In the DNA example, we can take as E the couple of matching 
profiles (that of the trace and that of the suspect) and as B the database of reference. D, 
E, and B can be regarded as random variables, such that D = ( E,B ). 

Bayesian and frequentist methods differ in how they consider the parameters 9 and h. 
In a Bayesian context they are modelled through random variables 0 and H, which are 
given prior distributions p{9) and p(h). Frequentists consider them as fixed (i.e., without 
distribution) unknown quantities. Regardless of the type of approach which is chosen, 
some model assumptions concerning E and B , 9 and h can be made: 

a. The distribution of B given h and 9 , only depends on 9. 

b. B is independent of E, given h and 9. 

In the DNA example, condition a holds if for instance the database is collected before 
the crime, since the sampling mechanism to obtain the database of reference is independent 
of which hypothesis is correct. Condition b holds if the suspect has been found on the 
ground of different evidence that has nothing to do with DNA. 

2.1 The Bayesian approach 



Figure 1: Bayesian network representing the dependency relations between E (evidence of 
the case), B (background data), 0 (nuisance parameter), and H (hypotheses of interest). 

A full Bayesian model is defined by giving the prior joint probability distribution for 
all the random variables of the model (here E, B , H and 0). It can be represented by 
the Bayesian network of Figure 1, which is in turn equivalent to the following Bayesian 
reformulation of conditions a, and b, with a third additional condition: 

Bayesian a. B is conditionally independent of H given 0. 
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Bayesian b. B is conditionally independent of E given 0 and H. 

Bayesian c. 0 is unconditionally independent of H. 

Condition Bayesian c is guaranteed for instance if prior beliefs on 9 and on h are assessed 
by people with different responsibilities and tasks: a judge for h and a forensic DNA 
expert (or a statistician) for 9. The joint prior can be factorized as follows, by looking at 
the structure of the Bayesian network or, equivalently, using the three conditions above: 
p(9,h,b,e ) = p(9)p(h)p(b\6)p(e\9, h). By choosing a prior distribution for 9 and h which 
reflects expert’s beliefs, the Bayesian probability is an expression of the subjective credence 
of the experts. The distribution of all other variables given 8 and h is defined by the 
structure of the model, and needs no subjective assessment. 

The Bayesian likelihood ratio can be derived in the following way: 

_Pr (E = e,B = b\H = h p ) _ Pr (E = e\ B = b,H = h p ) _ fp(e \ b 1 h p ,9)p{9 \ b,h p )&6 

“Pr (E = e ,B = e\H = h d ) ~ Pr (E = e\B = b,H = h d ) ~ fp(e | b,h d ,0)p{0 \ b,h d )d6 

_ f 9p{8 | b)d9 _ E(0 | B = b ) 

~ J9 2 p(8 | b)d9 ~ E(0 2 | B = b)' 

Some simplifications have been carried out because of conditions a, b, and c. Since it 
is possible to marginalize out over all values of 0, using its distribution, there’s no need 
to estimate the likelihood ratio, or to account for uncertainties, if a proper full Bayesian 
approach is chosen. 

In the rest of the paper we only focus on frequentist methods to solve the rare Y-STR 
haplotype match problem, but a companion paper presents a similar study on Bayesian 
methods (Cereda, 2015). 

2.2 The frequentist perspective 

The difference between frequentist and Bayesian methods regards parameters h and 9: for a 
frequentist they are fixed quantities, whose values correspond to, respectively, the unknown 
true value of 9 and the correct hypothesis. One can see frequentist models as Bayesian 
models where the distributions chosen for © and H give probability one to values 9 and h, 
respectively. Also, one can express the frequentist probability Tr in terms of the Bayesian 
probability Pr, in the following way: (Pr{- \ h ) := Prg(- \ h ) = Pr(-\H = h,Q = 9). For 
frequentist statisticians, there is a true, ‘physical’ probability which governs the situation 
at hand: according to the prosecution this true probability is Prg(- \ h p ), while according 
to the defense it is Prg(- \ hd), with 9 set to its true (unknown) value. 

Conditions a and b can be rephrased, in a frequentist language as: 

Frequentist a. ‘Prg^B = b \ h p ) = 2rg(B = b \ hd), for all 9 and b. 

Frequentist b. Prg(E = e \ B = b,h) = Prg^E = e | h), for all 9, h, e, and b. 
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It holds that: 


(Pr(D = d | h p ) Pr(E = e, B = b \ h p ) Pr(E = e \ B = b, h p ) tPr(B = b \ h p ) 
Pr(D = d | hd) Pr(E = e, B = b \ hd) 2r(E = e \ B = b, hd) Pr(B = b \ hd) 

The index 6 has been omitted for ease of notation. Thanks to conditions Frequentist 
a and b, the likelihood ratio can be expressed as 


tPr(E = e | h p ) 
Pr(E = e | h d )' 


(3) 


Even though the two alternative ways of writing the likelihood ratio expressed by 
equations (2) and (3) are theoretically different, and mean two different things, they have 
the same value. This implies that part of the information, namely B, is not useful to 
discriminate between the two hypotheses of interest. Stated otherwise, when knowing 6, 
B is irrelevant to determine the likelihood ratio, i.e. to decide about parameter h. However, 
it may play an important role in the estimation of parameter 9. For instance, getting back 
to the DNA example, the database ( B ) is often useful to estimate the frequencies of the 
different haplotypes. 

Notice that, in order for (3) to hold, b can be modified to something less strong: 


Frequentist b*. 


2tq(E = e | B = b, h p ) Pr g (E = e \ h p ) 


for all e, b, and 9. 


Tr e {E = e \ B = b, h d ) tPr g (E = e\ h d ) 
which is equivalent to ask that updating the likelihood ratio for the observation of B to 
take into account the observation of E, does not change anything. 

Furthermore, while conditions a and b* imply (3), the converse is not true. Formulation 
(3) is instead equivalent to a weaker condition, that is: 


Frequentist c. ( Pr g (B = b \ E = e, h p ) = tPr g (B = b \ E = e, hd), for all 9. 

This can be seen by the following alternative development of the likelihood ratio (9 
omitted): 

Tr(D = d | h p ) Pr(B = b \ E = e, h p ) (Pr(E = e \ h p ) Pr(E = e \ h p ) 


Er(D = d | h d ) 
It follows that: 


Tr(B = b | E = e, hd) Tr(E = e | h d ) Tr(E = e | h d ) 
c^LR= q,r(E = 6lhp) 


(4) 


(5) 


Tr(E = e\h d )' 

Notice that frequentists use a likelihood ratio Lr g , which can be written in terms of the 
Bayesian LR as LR|0 = 9 (read “LR given 0”), and attempt to get close to 9 by choosing 
some estimator 9. This leads to the so-called plug-in estimator Lr g = Lr-g = LR|(0 = 9). 
However, that’s not the only option, as we will see for the method explained in Section 7. 

It is important to notice that the frequentist approach may be represented by the 
same Bayesian network of Figure 1, where the states of nodes 0 and H are instantiated 
to particular values 9 and h, respectively. This shows that actually the two approaches 
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don’t disagree on the structure of the model regarding E and B. Only, Bayesians add 
ingredients to the model by allowing 0 and H to have a distribution. Stated otherwise, 
the Bayesian approach is defined by the very same frequentist conditions a and b, with 
the addition of condition c about the independence of 0 and H. 

3 Data reduction 

Let us denote with V all the data given to the expert in the form of a dossier, which he 
has to “translate” into a well-defined mathematical object. To evaluate the entirety of 
the data at the expert’s disposal is often a delusion, from which the need for a reduction 
of V to something less informative, but of more feasible evaluation, which we denote as 
D. Often the database contains only information about a limited number of loci, and this 
implies that information about other loci of the crime stain can’t be used. This constitutes 
already a first reduction of the data. Other kinds of reductions are performed in order to 
gain in terms of precision of the estimates. Especially in a situation with many nuisance 
parameters, it can be wise to discard the part of data which primarily tells us about 
the nuisance parameters, and only indirectly about the ultimate question of interest (i.e., 
which hypothesis is more likely to be true). In fact, it could be very wise to reduce the 
data P to a much smaller amount of information, because the likelihood ratio based on 
the data reduction is much more precisely estimated than one based on all data. However, 
there’s a limit to this: the reduction of T> into D comes with a cost: the stronger the 
reduction, the less the corresponding likelihood ratio value is discriminating of the two 
hypotheses, because less information is less powerful to that purpose. We have to make 
a compromise between a gain in terms of precision and a loss in strength of the evidence. 
This will be discussed more in detail in Section 8. 

Once a particular reduction D has been defined, the frequentist likelihood ratio ( Lr ) 
can be defined as in (2). It is easy to understand that there isn’t a unique way to reduce 
T> and that each choice entails the definition of a different likelihood ratio. For instance, 
in the DNA example, one can consider a profile made of more or fewer loci. Another 
kind of reduction will be presented in Section 7. Different choices of D C V lead to 
different likelihood ratios. Therefore, it is better to refer to “a” likelihood ratio instead of 
to “the” likelihood ratio. This was already stated in Dawid (2001), even though regarding 
hypotheses instead of data. 

In the literature different choices ofLCP and ‘Pr’ are proposed, each corresponding 
to a different likelihood ratio to be estimated. These choices are often only implicit and 
one of the aims of this research is to make explicit the reduction which corresponds to two 
selected methods, by looking for the corresponding E and B. 
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4 Different levels of uncertainty 


The likelihood ratio measures the relative strength of support given by the data to a 
hypothesis over an alternative. Clearly, it is useful when there is uncertainty about which 
of the two hypotheses is true (to be more precise, it may also be the case that none of 
the alternatives is correct, and the likelihood ratio continues to be meaningful). Along 
with this first basic initial uncertainty about the state of the affairs, two more levels of 
uncertainty arise in the attempt of calculating the likelihood ratio. 

For a frequentist statistician, the likelihood ratio is a ratio of probabilities based usually 
on a model Ai which is at best only a good approximation to the truth. Moreover, they 
have to estimate parameters of that model by fitting it to the data in some database. 
Stated otherwise, after a particular choice of what is the data D to be considered, a 
population model is to be chosen and its parameters estimated using a limited sample. 
Some forensic literature (Morrison, 2010; Stoel and Sjerps, 2012; Curran et al., 2002; 
Curran, 2005) already pointed out the necessity for uncertainty assessment in the likelihood 
ratio estimation, even though they don’t differentiate among levels. On the other hand, 
for a true Bayesian statistician there’s no need for estimation, and no additional levels of 
uncertainty to be added, since the definition of the Bayesian Pr already includes not only 
beliefs about chances when picking people from that population, but also beliefs about 
parameters of the models, and beliefs about models. 

This discussion may hopefully put an end to the debate as to whether it makes sense 
to talk about ‘estimation’ and ‘uncertainty assessment’ for the likelihood ratio. Stoel 
and Sjerps (2012) believe that “there are strong arguments for the notion of a “true” 
but unknown value of the likelihood ratio, given the relevant hypotheses and background 
information, and that it is important to consider the uncertainty. Ignoring the uncertainty 
can be strongly misleading”. This point of view is also shared in Sjerps et al. (2016). On 
the other hand, to talk about estimation of the likelihood ratio is defined as “internally 
inconsistent, and hence misconceived” by Taroni et al. (2015). Both the points of view are 
correct, if correctly put into context: if a frequentist approach is chosen it is sensible to 
talk about ‘estimation’ and to deal with uncertainty assessment. On the other hand, in a 
full Bayesian context, they are misplaced. 

Notice that Bayesianism is theoretically a very powerful interpretation of probability, 
but when it comes to applying Bayesian theory for practical purposes, even the most 
fervent Bayesian has to strike a balance between what is feasible and what is theoretically 
right and coherent according to the Bayesian perspective. He typically chooses a particular 
model as the correct one (as frequentists do), and/or he has to put convenient (rather than 
realistic) prior distributions on the parameters. Hence, whether Bayesian or frequentist 
approaches are chosen, the attempt to produce the likelihood ratio leads to several levels 


of uncertainty which should be accounted for. 

We will now discuss the two additional levels of uncertainty mentioned before. The 
second level of uncertainty pertains to the choice of a particular population model, which 
is only an approximation of the truth. This level of uncertainty may be reduced using 
nonparametric methods, that are based on fewer assumptions. 

Given a particular population model, the third level of uncertainty pertains to the fact 
that the population parameters are not known. This may involve estimation of parameters 
(such as in the discrete Laplace method of Section 6) or the direct estimation of the 
probabilities of interest (as in the Generalized-Good method described in Section 7) and 
the quality of the estimates severely depends on the size of the available databases. This 
level of uncertainty pertains both to parametric and nonparametric methods. 

The evidential value reported depends on all the levels of uncertainty which afflict 
the estimation of the likelihood ratio. Thus, it is of the utmost importance to report 
the likelihood ratio value along with (1) an explicit definition of which data D we want to 
evaluate through that likelihood ratio, and (2) a discussion (and if possible a quantification) 
of the levels of uncertainty that afflict the reported value. 

4.1 Estimating the weight of evidence 

Instead of estimating the likelihood ratio, it is more sensible to directly estimate its loga¬ 
rithm, sometimes called relevance ratio or weight of evidence (Good, 1950; Aitken et al., 
1998; Aitken and Taroni, 2004). This is because the interpretation of the likelihood ratio 
values goes through orders of magnitude 10, and when a value is reported, it is important 
to control the relative error, rather than the absolute error. In fact, the first is meaningful 
in itself while the second depends on the particular value of the likelihood ratio. For the 
very same reasons why the verbal equivalent scale (Aitken et al., 1998) is based on loga¬ 
rithm. Furthermore, both the odds form of Bayes’ theorem and the formula to combine 
likelihood ratios from independent pieces of evidence involve a multiplicative relationship 
that becomes a handier additive relation if logarithm is taken (Schum, 1994). Moreover, 
the logarithm helps in presenting large numbers in a compact way, of more easy com¬ 
prehension, and it is symmetric with respect to prosecution’s and defense’s hypothesis: 
this may be useful if one wants to invert the weight of evidence to consider the defense’s 
proposition (Aitken and Taroni, 2004). 

5 The rare Y-STR haplotype problem 

Consider the situation in which a piece of evidence is recovered at the crime scene, and a 
suspect turns out to have the same analyzed characteristics (for instance the same DNA 
profile) as the crime scene evidence. The prosecution claims that the suspect left the 
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evidence, defense claims that someone else (with the same DNA profile) left it. The 
capability of the match to discriminate between the competing hypotheses is evaluated by 
comparing how probable it is under each of the hypotheses. This depends on the proportion 
of individuals in possess of the same profile in the population of possible perpetrators: the 
rarer the profile the more the suspect is in trouble. This proportion is usually unknown, 
the only available data being a sample of DNA profiles from the population, in the form of 
a reference database. The naive estimator uses the relative frequency of the profile in the 
database as an estimate for 6. Problems arise when this frequency is 0, the so-called “rare 
type match”. This problem is so substantial that it has been defined “the fundamental 
problem of forensic mathematics” by Brenner (2010). As an alternative to the empirical 
frequency estimator, one can use the add-constant estimators, which adds a constant to 
the count of each type, included the unseen ones. The most well known is the add-one 
estimator, due to Laplace (1814), and the add-half estimator of Krichevsky and Trofimov 
(1981). However, to use these methods one needs to know the number of possible unseen 
types and there are problems if this number is large compared to the sample size (see Gale 
and Church (1994) for additional discussion). Another possibility is the ‘rule of three’, 
proposed by Louis (1981). It states that 3/n is a good approximation of the 95% upper 
bound for the frequency, if n is the size of the database. 

Of interest for this paper is the nonparametric Good-Turing estimator of Good (1953), 
based on an intuition on A. M. Turing. It is an estimator for the total unobserved proba¬ 
bility mass which is based on the proportion of singletons in the sample. For a comparison 
between add one and Good-Turing estimator, see Orlitsky et al. (2003). 

The naive estimator and the Good- Turing estimator are in some sense complementary 
(Anevski et al., 2013): the first gives a good estimate for the observed types and the 
second for the probability mass of the unobserved ones. Lastly, the high profile estimator, 
introduced by Orlitsky et al. (2004), extends the tail of the naive estimator to the region 
of unobserved types. This estimator has been improved by Anevski et al. (2013) that also 
provides the consistency proof. 

The rare type match problem is common, for instance, in case a new kind of forensic 
evidence is involved, and for which the available database size is still limited. One exam¬ 
ple is the case of DIP-STR markers (e.g. Cereda et al., 2014). The same happens when 
Y-chromosome (or mitochondrial) DNA profiles are used: because of the lack of recom¬ 
bination involved when offspring DNA is generated from the DNA of the parents, each 
haplotype must be treated as a unit (the match probability can’t be obtained by multipli¬ 
cation across loci) and the set of possible haplotypes is extremely large. As a consequence, 
most of the Y-STR haplotypes are not represented in the database. 

In the rest of the paper, Y-STR data will be retained as an extreme but common and 
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important way in which the problem of assessing the evidential value of rare type match 
can arise. Literature provides some examples of approaches to evaluate it for the rare 
Y-STR haplotypes match: Egeland and Salas (2008), the k method Brenner (2010, 2014), 
the coalescent theory method (Andersen et al., 2013a), the haplotype surveying method 
(Roewer et al., 2000; Krawczak, 2001; Willuweit et ah, 2011), and the discrete Laplace 
method (Andersen et ah, 2013b) (not directly proposed for the rare haplotype case but 
usable for that purpose). As already mentioned, Cereda (2015) discusses the full Bayesian 
approach to this problem. 

Bayesian nonparametric estimators for the probability of observing a new type have 
been proposed by Tiwari and Tripathi (e.g. 1989); Lijoi et ah (e.g. 2007); Favaro et ah (e.g. 
2009). However, for the likelihood ratio assessment it is required not only the probability 
of observing a new species but also the probability of observing this same species twice 
(according to the defense the crime stain profile and the suspect profile are two independent 
observations). Cereda (2015b) is the first paper that addresses the problem of likelihood 
ratio assessment in the rare haplotype case using Bayesian nonparametric models. 

The present paper analyses two frequentist methods, the discrete Laplace method, and 
a generalization of the Good-Turing, making explicit the corresponding definitions of D, 
E, and B, and providing a study on the different levels of uncertainty arising for each. 

6 The discrete Laplace Method 

A discrete random variable X is said to follow the discrete Laplace distribution DL (p,y), 
with dispersion parameter p G (0,1), and location parameter y G Z, if its probability 
density is defined as 

f{x | p,y)= P lX ~ Vl » Vx G Z. 

This is used in Andersen et al. (2013b) to model the distribution of single locus Y-STR 
haplotype in some subpopulation, which is thus assumed to be distributed around a modal 
allele (represented by the location parameter y). 

Each haplotype is actually composed by r loci. Let denote with X = (X\, X 2 , ... ,X r ) 
the random variable which describes an r-loci haplotype configuration. Moreover, there 
may be c different subpopulations to take into consideration. By making the strong as¬ 
sumption of independence between loci, within the same subpopulation, the following 
density is used to describe the probability that X = x: 

c r 

/( x I (yjln {PjW = XI T i II f( Xk I yjkiPjk), 

j =1 k= 1 

where, for each j, Tj is the probability a priori of generating from the jth subpopulation, 
while pj = (pj\-Pj 2 i ••• ,Pj r ) and yj = (yyi , y-yi- represent the dispersion and location 
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parameters, respectively, of the jth subpopulation. Andersen et al. (2013b) propose to 
estimate all these parameters by using using the EM algorithm (Dempster et al., 1977). 
The initial subpopulation centres are chosen by PAM algorithm (Kaufman and Rousseeuw, 
2009) and the number of them by the Bayesian Information Criteria (BIC) (Schwarz, 1978). 

6.1 The choice of D in the discrete Laplace Method 

The choice of D which underlies the discrete Laplace method, when used to address the 
rare haplotype match problem is: 

• -Ddl = the particular haplotype x of the suspect and of the stain, along with a 
database which is a sample from the population of possible perpetrators. 


This method allows one to evaluate the data in the light of the usual hypotheses of 
interest in the DNA example (see Section 2). E>dl can be split into Edl and Bdl> hr the 
following way: 

• Edl = the particular haplotype x of the stain (E t ) and of the suspect ( E s ). 

• -Bdl = random sample from the population of possible perpetrators (i.e. database). 


The vector containing the frequencies of all haplotypes in the population of reference can 
be thought of as the nuisance parameter 6 of this model. Conditions a. and b. presented 
in Section 2 are valid for Ed l, Ed l> 0, and h, thus the following likelihood ratio (where 6 
is again omitted) corresponds to this choice of data, evidence, background, and model: 


Pr(D dl = d | h p ) _ ffr(E t = x \ E s = x, h p )Pr(E s = x \ h p ) 
Pr(D DL = d\h d ) Tr(E t = x \ E s = x, h d )(Pr(E s = x \ h d ) 
Tr(E t = x | E s = x, h p ) 1 
Pr{E t = x\ h d ) f x 


Here, f x is the frequency of the haplotype x in the population of reference. The second 
equality is due to conditions a and b discussed in Section 2.2, while the fourth one is 
justified by the fact that the distribution of the haplotype of the suspect does not depend 
on which hypothesis is correct, and that, when 6 is fixed (as in the frequentist approach 
which we are considering) and under h d , Et is independent of E s . The weight of evidence 
is thus 

logi 0 Tr DL = logic ~r- (V 

Jx 

The frequency f x can be estimated by f x , using the discrete Laplace method. This brings 
to the following plug-in estimator for log 10 LrDL: 

logic ir DL = logic i- 
Jx 
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Notice that the discrete Laplace method uses the database to estimate the number of 
subpopulations and all the parameters in the model, and this is where -Bdl comes into 
play again. 

6.2 Quantifying the uncertainty of the discrete Laplace method 

We quantify the uncertainty of this method comparing the distribution of log 10 Ltd l = 
log 10 — with the distribution of the “true” log 10 LrDL = log 10 ~r- fx is not known, but 

fx J x 

we have a database of approximately 19,000 Y-STR 23-loci profiles from 129 different 
locations in 51 countries in Europe (Purps et al., 2014) , which we can pretend contains 
the whole population of interest for our case. We will consider only 7 loci out of 23 and 
perform the following experiment: we sample a small database of size N = 100, along with 
a new haplotype (not observed in the small database), and calculate the estimate log 10 A. 

fx 

Then, we can use the relative frequency of the haplotype x in the big database as the true 
one, f x to obtain log 10 j~- 

This process can be repeated many times (for instance M = 1000 samplings of small 
databases of size N = 100 and, for each, a never observed haplotype). 

In estimating log 10 Ltd l via f x , one has the choice between adding the haplotype x to 
the small database before estimating parameters of the discrete Laplace distribution, or 
not. In a full Bayesian approach the right thing to do is to add the profile to the database. 
This is shown in Cereda (2015), and we believe that it is the good thing to do also in a 
frequentist framework. In fact, experiments show that to add or not the haplotype to the 
database does not make much difference. 

Table 1 and Figure 2 (left part) compare the distributions of log 10 Lr and log^LroL, 
using 7 loci. The same experiment has been carried out for 10 and 3 loci, but not reported 
in details. 



Min 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max 

s.d. 

loglo/TDL 
loglO ir DL 
Error e^L 

1.305 

1.432 

-1.37 

2.733 

3.441 

0.217 

3.277 

4.061 

0.807 

3.272 

4.114 

0.842 

3.800 

4.750 

1.39 

4.277 

8.452 

4.476 

0.666 

1.017 

0.863 


Table 1: Summaries of the distribution of log 10 Ltd L; logio LtdL', and of the error e^L- 


The error of the discrete Laplace method can be defined as eoL := logio Ll "o l — 
log 10 LrE)L. measures how much the estimated distribution differs from the true one. 
Table 1 and Figure 2 (right part) show the distribution of the error. One can see that it 
can attain up to about 4 orders of magnitude. The distribution of the error is mostly lo- 

1 A clean version of the database is provided by Mikkel Meyer Andersen (http://people.math.aau.dk/ 
~mikl/?p=y23). 
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log 10 Lr DL and log 10 Lr DL 


Error 



DL estimated DL true Error DL 


(a) (b) 

Figure 2: discrete Laplace method. Boxplots comparing the distributions of log 10 Lr^ 
and log 10 2 >dl (left) and the error eoL = log 10 LrDL- logio-^DL (2nd column). 

cated on positive values, which means that, more often than not, log 10 Lrj )l overestimates 
log 10 LrDL- The standard deviation of the error is small, thereby cdl does not move too 
much away from its mean, which is about 0.842. 

Motivated by the discussion of Section 4, we now analyze the different levels of un¬ 
certainty which affect the error. The second level of uncertainty is introduced when the 
discrete Laplace model, along with all its set of assumptions, is chosen to model the 
distribution of single locus haplotypes, which in reality do not follow a discrete Laplace 
distribution. 

The third level of uncertainty pertains to the estimation of the parameters of the model 
(c, p, y, t). Here, the databases used to estimate the parameters of the discrete Laplace 
model are probably too small (N = 100) with regard to 7 loci. 

To decrease both sources of error, one can reduce the number of analyzed loci to 3. The 
population becomes less sparse, and the databases big enough. Indeed, we performed this 
experiment and the error decreased a great deal. However, the basic level of uncertainty 
(see Section 4) is increased inasmuch the data becomes less effective to discern between 
the two hypotheses. On the other hand, the same experiment with 10 loci leads to obtain 
more powerful likelihood ratios, but less precise. 

The second level of uncertainty can be made harmless assuming an infinite number of 
subpopulations, since in this way the model will perfectly fit any population. However, this 
solution will increase the number of parameters, along with the third level of uncertainty. 

It is worth underlining that the results of our simulations do not mean that the discrete 
Laplace method is wrong on the whole, but they show that a blind use of this method is 
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dangerous. We are applying this method to the specific case of the rare haplotype match, 
using a database of size 100, and a rather sparse population: maybe this method was never 
intended to be used for such small databases, and maybe it can be modified in more clever 
ways to that purpose. 

7 The Generalized-Good method 

Based on Good (1953), we now propose a nonparametric estimator for the weight of 
evidence. This is a very good example of data reduction, since T> is here reduced to a 
greater extent than it was done for the discrete Laplace method. Indeed, the specific 
haplotype x of the crime stain and of the suspect is ignored, retaining only the fact that 
they match and the fact that this profile has not been observed yet in the database. 
Stated otherwise, 

• Dqq = the haplotype of the suspect matches the haplotype of the crime stain and 
it is not in the database. 

Consider the following mathematical description: the database of size N can be seen 
as an i.i.d. sample (Yi, Yi ,..., Y/v) from species {1, 2,..., 5}, with probabilities (pi,P 2 , ■■■PS )• 
Hence, the suspect’s profile can be though of as the N + 1st i.i.d. observation. The crime 
stain’s profile is the N + 2nd observation. According to the defense it is again an i.i.d. 
draw from (pi,p 2 , ■■■Ps), while according to prosecution it is equal to the value of Yjy+i, 
with probability one. 

The likelihood ratio for this reduction of the data can be thus written as 

_ Pt(Yn + 1 ^ y Nl Y N+ i = Yjy+2 I h p ) _ _ Tr(Y N +1 ^ 3V | h p ) _ 

GG 2V(Y/v+i ^ YniYn+i = Y/v +2 | h d ) Pr{Y N+1 ^ 3V,Y)v+i = Yn +2 | h d ) 

From now on, we are presenting results regarding a general database size N > 2, and 
general random variables i-i.d. from (pi,P 2 > ■■■,Ps)- The following notation is 

used: 

e 1 (N-p 1 ,p 2 ,...,ps) -.= Vr{Y N i {Y U Y 2 ,...,Y N ^}), 
d 2 (N- Pl , P 2 ,...,ps) :=?r(Y N £ {Y U Y 2 , ...,Y N _ 2 },Y N = Y N _ i). 

To make the notation less cumbersome we will use 

3V := (Y\,Y 2 ,..., Y)v), 

y ilN := (Yi,Y 2 , ..., Yi-i, Yi+i, •••, Tv), 

y ( i,j),N ■ (Yi, Y2,..., Yi—i , Y)_|_i,..., Yj—i, ,..., Yv)j Vz < j. 

Moreover, for any random variable Y, and any couple of sets A and B, l J 4 n _Bc(Y) is a 
random variable which has value 1 if Y belongs to the set A and not to the set B, and 
zero otherwise. 
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Theorem 1. An unbiased estimator for 0i(N-,pi,p2, ...,ps) is 9i(N) = N\/N, where N\ 
is the number of singletons in the database. 

Proof. 

1 N 

9 i{N;pi,P 2 ,-,Ps) = < Pr(Y N £y N - 1) =E(l(y N _ 1 y(Y N )) = — ^E(l (y . Jv) c(y i )) 

i= 1 



where the last equality is due to the fact that the function l(y. N y(Yi ) has value 1 for every 
singleton of the database: the sum is thus the number of singletons (IVi). □ 

Theorem 2. An unbiased estimator for 92{N-,pi,p2, ...,ps) is 02(N) = 2 N 2 /N(N — 1), 
where N 2 is the number of doubletons in the database. 

Proof. 


02{N;pi,p2,-,ps ) = t Pr(Y N {^- 2 },^ = Yn- 1 ) = E ( 1 {y J v^in(y J v- 2 ) c }(^)) 


2 _ / 2 _> 

_/Y(iV — 1) — E I jy^jy _ -|^ ■*-{yn(y( i J) i iv) C }(^) 


l<3 

= e( ^ , 

\N(N - 1) 


where the last equality is due to the fact that the function l{y )n (>y i:;/) ;V ) c }(^i) has value 1 
for each of the N 2 doubletons of the database. □ 


The two previous theorems can be easily generalized to 9 rn defined as 9 m (N;pi,p2,ps ) := 
tPr(Y N ^ yN-m,Y N = Tjv-i = •• = Y N - m+ i). 

Now we can estimate log 10 Ltqq in the following way: 


logio Lr GG = logio 


logio 


Pr(Y N+1 fi y N | h p ) 


‘Pt(Yn +1 ^ Yn,Yn+i = Yn+2 
9i(N;pi,p 2 ,-...,ps) 


h d ) 


logio 


‘My n $ y N ~ 1 


5’r(Tiv^^-2,E i v = y i v_ 1 ) 


0 2 (y:pi.p2,o-Ps)' 

Thus, we propose the following estimator for the weight of evidence: 


, -- , 0i(N) , (N- l)Ni , AWi 

log 1 » irGG = log >» gyy = logw * logi » 2ivy (8) 

Notice that there are two kinds of approximation steps: a mathematical approximation 
of 9\(N + l',pi,p2, ...jPs) with 9i(N-,pi,p2, ...,ps), which should hardly make any differ¬ 
ence, for reasonably large N, and a statistical estimation of 9i(N-,pi,p2, ...,ps) using an 
unbiased estimator (and similarly for 02 ). 
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It is important to underline that, due to Jensen’s inequality, the estimators log 10 9\ 
and log 10 02 are not unbiased for log 10 9\ and log 10 62 , but it will be shown by simulations 
that log 10 Ltqq is approximately unbiased for log 10 LrGG- However, the point is not to 
find an unbiased estimator, but one with a small error rate. 

Notice that in order to estimate log 10 i>GG if is n °t necessary to use all the information 
contained in the database, but only N, N\. and IV 2 , that is the number of singletons 
and doubletons in the database. The nuisance parameter of the model is the vector 9 
containing the frequencies of the Y-STR haplotypes in the population of interest. 9\ and 
6*2 are functions of 9. 

The limitation of this method is that it cannot be used if IV 2 = 0 (this corresponds to an 
infinite likelihood ratio) and it does not perform well also in case the number of singletons 
is very small or zero. We believe it can be improved and extended by smoothing techniques 
(Good, 1953; Anevski et ah, 2013), but we are going to ignore this problem. 

The ‘^-method’ of Brenner (Brenner, 2010) is based on an analogous line of reason¬ 
ing. It estimates the likelihood ratio as Lr K = ^ N ~ Ni ■ However, in the derivation of this 
estimator, there is an approximation involved, based on assumptions which are not always 
satisfied, leading sometimes to anti-conservatism (see also the discussion in Buckleton 
et al. (2011), and the answer in Brenner (2014)). In particular, Brenner (2014) provides a 
pathological population where the approximation does not hold, while showing empirical 
evidence that for Fisher-Wright populations the condition is fulfilled. Our method is, on 
the other hand, based on a principled derivation of the estimator of equation ( 8 ), which is 
similar to Brenner’s one under the following conditions: there are almost only singletons 
and doubletons in the database, and Ni IV 2 . These assumptions are typically satisfied, 
explaining why Brenner’s method often works. They also constitute a good description of 
when it does not work. 

Lastly, we remark that this method can be generalized in the obvious way, to the case 
in which the haplotype is indeed in the database. Moreover, this method is suitable to be 
directly applied to different kinds of evidence. 

7.1 Quantifying the uncertainty of the GG method 

As we did in Section 6 . 2 , we want to quantify the uncertainty of this method. One way is 
to compare the distribution of 

1 —T- 1 NN ' 

logio Ltgg = logio 2A?2 , 

with the distribution of the “true” 

, „ Vr(Y N+1 £y N ) 9i 

Ogio r GG Og 10 £ y N n y N+i = w) ■ Og 10 . 
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Actually, the latter is not a distribution, but a single value, unknown. Again, we pre¬ 
tend that the database of Purps et al. (2014) contains the profiles of the whole population, 
to find out the ‘true’ 6 \ and 62 , restricting our simulations to 7 loci. To do so, we sample 
M small databases of size N = 100, along with two other haplotypes. 0\ is the proportion 
of times in which the (N + l)st haplotype is a new one (i.e., not one of the previous N), 
and 62 is the proportion of times in which the (N + 2)nd is equal to the (N + l)st, and 
different from the first N observations. From our simulations, we used M = 100, 000, and 
we obtained 0 1 , 62 , and log w Lr as in Table 2. 


0 1 

62 

True log 10 Ltqq 

0.748 

0.0012 

2.78 


Table 2: Values of 9\ and 62 and of log 10 Tree obtained by simulations, assuming that the 
database of Purps et al. (2014) contains the whole population of interest. 

The distribution of log 10 Ltqq = log 10 can be obtained by sampling M = 100,000 
databases of size N = 100. Out of 100,000 databases, 121 had N 2 = 0. They have 
been removed from the data, and we acknowledge that this choice creates unfairness to 
the discrete Laplace method. On the other hand, we believe that this occurs frequently 
enough not to affect very strongly the comparison. 



Figure 3: Boxplots of the distribution of log 10 Ltqq around the true value log 10 i>GG 
(black line). 

Figure 3 shows the distribution of the estimator log 10 Ltqq around the true value 
(black line). The error of the Generalized-Good method, defined as eQQ = logi 0 Ltqq — 
log 10 LrGG) tells us how much the estimator differs from the true value. 

Table 3 provides the summaries for log 10 Ltqq , and for the error eDL- We don’t provide 
the plots for the distribution of ecc since they are identical to those in Figure 3, shifted 
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Min 

1st Qu. 

Median 

Mean 

3rd Qu. 

Max 

sd 

logy Tree 

logio Lt gg 
Error e GG 

2.78 

2.215 

-0.566 

2.78 

2.682 

-0.098 

2.78 

2.792 

0.0112 

2.78 

2.818 

0.038 

2.78 

2.920 

0.14 

2.78 

3.668 

0.887 

0 

0.198 

0.198 


Table 3: Summaries of the distribution of log 10 Lt g g, of log 10 Lr GG , and of the error ecG- 


of log 10 Lr GG . 

One can see that the error can attain up to about 0.9 orders of magnitude. The 
distribution of the error is mostly located on the positive values, which means that, more 
often than not, log 10 Lr GG overestimates log 10 i>GG- The standard deviation of the error 
is small, thereby e GG does not move too much away from the mean, which is about 0.038. 
If compared to the error of the discrete Laplace method, one can conclude that here we 
get a better estimator in terms of accuracy, since the error ranges over more restrained 
values and the standard deviation is much smaller. However, it is important to keep in 
mind that they are not different estimators of the same quantity, but different estimators 
of different quantities, since the reduction of data used by the Generalized-Good method, 
which allows obtaining accuracy in the estimates is less strong to discern between the two 
hypotheses. 

8 Choosing and comparing methods 

In comparing the two methods one can consider the precision with respect to what the 
method is trying to estimate, quantified by the errors e^L, and ecc- These errors are 
due to the two second and third level of uncertainty described in Section 4, and decrease 
sensibly if data is reduced. This is why, under this aspect, the Generalized-Good is to 
be preferred to the discrete Laplace, and for the latter a fewer number of loci is to be 
preferred. However, it is not correct to believe that the greater the reduction, the better 
is the method. To reduce means to lose information, and thus to diminish the capability 
of the method to distinguish between the hypotheses at stake (the first, or basic level 
of uncertainty). In order to investigate this loss, one can compare each method to the 
likelihood ratio 1 // (where / is the population frequency of the matching haplotype), 
which can be considered the hallmark in a population with no substructure. Comparing 
Table 1 with Table 2 one can see for instance that choosing the Generalized-Good one loses 
on average around 0.5 (in logarithmic scale) in terms of strength of data to discriminate 
between hypotheses. This is a small disadvantage for the prosecution, while everybody 
gain in terms of precision with respect to the true log 10 Lr GG . As a last remark, we invite 
the reader to realize that the discrete Laplace method is better inasmuch it can always be 
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used. On the other hand, for the Generalized-Good, we had to remove 121 experiments 
where IV 2 = 0. 

9 Remark and conclusion 

The aim of this paper could, at first sight, be considered that of offering two additional 
frequentist methods to address the issue of the likelihood ratio calculation in the case of 
a rare Y-STR haplotype match. However, a careful reader may have realized that these 
methods also constitute two interesting opportunities to show and apply the guidelines 
exposed in the opening sections. In particular, two important facts are pointed out in 
Sections 3 and 4: first, it is more sensible to talk about “a” likelihood ratio instead of 
“the” likelihood ratio, and second, a quantification of the error involved in the estimation 
is to be provided along with the estimate of the likelihood ratio. 

Moreover, it is explained that sometimes it is possible to the break down the data to 
be evaluated into E (which is sufficient for H) and B (which is irrelevant for H). The 
discrete Laplace method (developed in Section 6) is a good example where this distinction 
can be done, while the same is not true for the Generalized-Good method (Section 7). 

Lastly, this paper wants to get across the message that reducing the data to a smaller 
extent is sometimes not only necessary, but also desirable in terms of exactitude of the esti¬ 
mates, as proved by the comparison between the discrete Laplace method (less reduction, 
less precision of the estimates) and the Generalized-Good method (stronger reduction, 
more precision of the estimates). In this respect we disagree with Buckleton et al. (2011) 
who, talking about Brenner’s method, state that ‘there is a merit focussing in the type 
or name of a lineage marker”. Although we agree that “such ignorance of type implies a 
substantial loss of information”, it may allow a large gain in precision. 

The take home message is that choosing the best method is clearly a very delicate task. 
One has to consider many different aspects, and look for a compromise which is acceptable 
for the specific application at hand. It is important to realise that in this paper we study 
a very extreme situation with very small databases and a possibly unrealistic population, 
for which the Generalized-Good seemed to be the best compromise. Clearly, there are 
no possible general conclusions to be given, other than at each new situation one has to 
reconsider all these aspects, and weigh them. 
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